Computational Methods for Qualitative Research in Criminology and Criminal Justice Studies

This article supplement, intended as a pedagogical tool, provides all of the code necessary to reproduce the case study illustration in Computational Methods for Qualitative Research in Criminology and Criminal Justice Studies (work in progress). To boost the pedagogical value of this resource, we have provided detailed explanations and commentaries throughout each step.

true , true , true
2021-08-12

Defining the Problem

Collecting

Website Recon

Index Scrape

The first step we took in our index scrape was to create a file (a csv file), on our local computer, that would store the results of our scrape. This is not the only way to do this, but is the way that we find to be most efficient. Another way to do this would be to store the results in the RStudio’s global environment, saving the results to your local computer after the scrape completes. Two major downsides to this approach are (1) you cannot see the results of the scrape until the scrape is completed; (2) if your scrape fails at some point (which it very likely will, especially on longer scrapes), you’ll lose the results you had obtained up to that point. So, let’s start by creating a csv spreadsheet that contains named columns for the data we’ll be collecting (headline_url, headline_text, etc.) in our index scrape. To do this we’ll use the library(tibble), library(readr), and library(tidyr).

#install.packages("tibble")
#install.packages("readr")
#install.packages("tidyr")
library(tibble)
library(readr)
library(tidyr)
# give the file you'll be creating a name 
filename <- "rcmp-news-index-scrape.csv"

# using the tibble function, create a dataframe with column headers
create_data <- function(
    headline_url = NA,
    headline_text = NA,
    date_published = NA,
    metadata_text = NA,
    page_url = NA
  ) {
    tibble(
        headline_url = headline_url,
        headline_text = headline_text,
        date_published = date_published,
        metadata_text = metadata_text,
        page_url = page_url
    )
  }

# write tibble to csv
write_csv(create_data() %>% drop_na(), filename, append = TRUE, col_names = TRUE)

Next, let’s write the code for our index scraping algorithm, which will obtain the data from the RCMP’s website and populate the csv file we just created in the last chunk of code. We’ll need an additional library to do this – library(rvest) – which will be used to get and parse the data from the RCMP’s website. To locate the information we want in the HTML, we’ll be specifying the html element that embeds the content we are interested in obtaining (link to full article, date, headline text, etc.). Obtaining these elements is a bit of art. There are developer tools built into every browser to easily obtain them. Another popular way of obtaining them is via the browser plug in “selector gadget”.

#install.packages("rvest")
library(rvest)
base_url <- 'https://www.rcmp-grc.gc.ca/en/news?page='

max_page_num <- NA # note that these pages are zero-indexed

scrape_page <- function(page_num = 0) {

  # grab html only once
  page_url <- paste(base_url, page_num, sep = '')

  curr_page <- read_html(page_url)

  # zero in on news list
  news_list <- curr_page %>%
    html_node('.list-group')

  # grab headline nodes
  headline_nodes <- news_list %>%
    html_nodes('div > div > a')

  # use headline nodes to get urls
  headline_url <- headline_nodes %>%
    html_attr('href') %>%
    url_absolute('https://www.rcmp-grc.gc.ca/en/news')

  # use headline nodes to get text
  headline_text <- headline_nodes %>%
    html_text(trim = TRUE)

  # grab metadata field
  metadata <- news_list %>%
    html_nodes('div > div > span.text-muted')

  # use metadata field to grab pubdate
  date_published <- metadata %>%
    html_nodes('meta[itemprop=datePublished]') %>%
    html_attr('content')

  # use metadata field to grab metadata text
  metadata_text <- metadata %>%
    html_text(trim = TRUE)

  # build a tibble
  page_data <- create_data(
    headline_url = headline_url,
    headline_text = headline_text,
    date_published = date_published,
    metadata_text = metadata_text,
    page_url = page_url
  )

  # write to csv
  write_csv(page_data, filename, append = TRUE)

  max_page <- curr_page %>%
    html_node('div.contextual-links-region ul.pagination li:nth-last-child(2)') %>%
    html_text(trim = TRUE) %>%
    as.numeric() %>%
    -(1)

  max_page_num <- max_page

  Sys.sleep(3)

  # recur
  if ((page_num + 1) <= max_page_num) {
    scrape_page(page_num = page_num + 1)
  }

}

# run it once
scrape_page()

Let’s inspect the result. To do this we’ll use the paged_table function from library(rmarkdown).

#install.packages("rmarkdown")
library(rmarkdown)
index <- read_csv("rcmp-news-index-scrape.csv")

paged_table(index)

Contents Scrape I (text)

Using the results of index scrape, we can conduct our contents scrape. This will involve visiting each url in our index (headline_url) and grabbing the content we want from each page. We’re going to grab three things from each: the headline url (in order to merge the results of our index and contents scrapes), the full_text of the article, and the link to any images contained in the news release (if there are any). Same as we did with the index scrape, we’ll start by creating a csv file on our local hard drive with named columns that correspond to the information we’re going to collect.

filename <- 'rcmp-news-contents-scrape.csv'

create_data <- function(
  headline_url = NA,
  full_text = NA,
  image_url = NA
) {
  tibble(
    headline_url = headline_url,
    full_text = full_text,
    image_url = image_url
  )
}

# write once to create headers
write_csv(create_data() %>% drop_na(), filename, append = TRUE, col_names = TRUE)

And now we can write the code for our scrape. To mix it up, we’ll use the lapply() function this time.

index_list <- as.list(index$headline_url)

lapply(index_list, function(i) { 
  
  webpage <- read_html(i)
  
  full_text <- html_node(webpage, ".node-news-release > div") %>% html_text(trim = TRUE)
  
  try(image_url <- html_node(webpage, ".img-responsive") %>% html_attr("src"))
  
  if(!is.na(image_url)){
    
    image_url <- image_url %>% url_absolute(i)
    
    }
  
  page_data <- create_data(
    headline_url = i,
    full_text = full_text,
    image_url = image_url
  )
  
  write_csv(page_data, filename, append = TRUE)
  
  Sys.sleep(3)
  
})

Finally, let’s combine the results our index and contents scrape into a single dataframe. We’ll save the combined csv file in our working directory.

# read in the two files
index_scrape <- read_csv("rcmp-news-index-scrape.csv")
contents_scrape <- read_csv("rcmp-news-contents-scrape.csv")

# combine the files using the headline_url column
combined_df <- contents_scrape %>% select(-headline_url) %>% bind_cols(index_scrape)

# save results
write_csv(combined_df, "rcmp-news-df.csv")

Parsing, Cleaning, and Exploring

#rm(list=ls()) # you may want to clear your global environment at this point

rcmp_news <- read_csv(here::here("rcmp-news-df.csv"))

Let’s take a look at the results our DataFrame so far. For this we can use the head() function from base R to view only the first 6 rows of each variable.

paged_table(head(rcmp_news))

Inspecting the region variable, we can see that both the city/town/county and province/territory are grouped together and separated by a “,”. For example, on the first row, we have “Iqaluit, Nunavut”, and one the second row, we have “Dauphin, Manitoba”. As it is going to be useful for our analysis to explore pattern in the data by province/territory, we will want to split the region variable so that we have two variables instead, one that contains the town/city/county information, and one that contains the province/territory information. We can achieve this using separate() function, which comes from library(tidyr). We will apply the separate() function to our region variable, saving the results of our DataFrame (rcmp_news) into a new DataFrame (rcmp_news_pp, where pp stands for “pre-processed”). The separate() function requires you to provide the names of the new variables you’ll be creating (i.e., what the information on the left and right of the “,” will be saved into). We will call these “region1” (town/city/county) and “region2” (province/territory).

#install.packges("dplyr")
library(dplyr)
rcmp_news_pp <- rcmp_news %>%
  separate(metadata_text, c("date", "region"), sep = "— ") %>%
  select(-date) %>%
  separate(region, c("region1", "region2"), sep = ", ")

paged_table(head(rcmp_news_pp, n = 200)) #inspect the results, again using the head() function, but this time let's inspect the first 200 results

Inspecting the first 200 results, we see that it mostly worked, until about page 14. On the second row of page 14, in the region2 column, we can see that “Ontario Media advisory” has been entered where we expected the result to be just “Ontario”. Flipping through more of the pages, we see a similar labeling issue on page 18 (“Ontario NCR”, “Guysborough County”) and on page 19 (“Manitoba Statement”). At this stage we should ask: how many of the region2 values contain information other than just the name of the province/territory? The easiest way to do this is to use the count() function from library(dplyr) to count the total number of times that each unique entry appears in the region2 variable. From here we can print the results of our count in a table that we can manually inspect.

rcmp_region_count <- rcmp_news_pp %>%
  count(region2)
  
paged_table(rcmp_region_count)

Paging through the elements in this table, we can see two problems: 1) there’s a lot of entries in the region2 variable that contain more or different information then the name of the province/territory; and 2) on the last page of the table, page 14, we see that there are 24 news releases in our corpus where the region2 value is NA, meaning it is missing. This brings us to our first major cleaning operation: fixing the entries in our region2 variable so that it only contains information on the province/territory of the RCMP detachment authoring the news release. To deal with this issue, we can write a custom function that re-labels each of the entries we want to re-label in region2. This is not the fastest nor most efficient way to achieve this result, but it is one of the easiest. (As this is a long chunk of code, we have hidden it from view. Click “Show code” to view.)

Show code
recode_regions <- function(region) {
  case_when(
    region %in% c(
      "Ontario National",
      "Ontario Statement",
      "Ontario Media advisory",
      "Ontario NCR",
      "Ontario National Statement",
      "Ontario National Media advisory",
      "Ontario National NCR",
      "Ontario National Statement NCR",
      "Ontario National Speech",
      "Ontario Statement NCR",
      "Ontario Speech",
      "Ontario Media advisory NCR"
    ) ~ "Ontario",
    region %in% c(
      "Manitoba Statement",
      "Saskatchewan National Speech",
      "Saskatchewan National Depot",
      "Saskatchewan Statement Depot",
      "Saskatchewan Media advisory",
      "Saskatchewan Depot",
      "Saskatchewan Media advisory Depot",
      "Saskatchewan National",
      "Saskatchewan National Statement",
      "Maidstone",
      "Shaunavon",
      "Saskatchewan Statement",
      "Archerwill",
      "Lake Diefenbaker",
      "Pelican Narrows",
      "Moose Jaw",
      "Weyburn",
      "Southend",
      "Emma Lake",
      "Biggar",
      "Langenburg"
    ) ~ "Saskatchewan",
    region %in% c(
      "Quebec National Media advisory",
      "Quebec National",
      "Quebec Media advisory"
    ) ~ "Quebec",
    region %in% c(
      "Nova Scotia Media advisory",
      "Nova Scotia National",
      "Halifax Regional Municipality",
      "Nova Scotia Speech",
      "Nova Scotia Statement",
      "Nova Scotia National Speech",
      "Queens and Kings Counties",
      "Victoria County",
      "Digby County",
      "Hants County",
      "Kings County",
      "Kings and Prince Counties",
      "Annapolis County",
      "Colchester County",
      "Antigonish County",
      "Queens and King counties",
      "Inverness County",
      "Queens County",
      "Lunenburg County",
      "Yarmouth County",
      "Antigonish County",
      "Hanty County",
      "Cumberland County",
      "Shelburne County",
      "Richmond County",
      "Green Creek",
      "Coichester County",
      "Richmond Co.",
      "Guysborough County",
      "Pictou County",
      "Annapolis COunty",
      "Annapolis Valley",
      "Hants Co."
    ) ~ "Nova Scotia",
    region %in% c(
      "Manitoba Statement",
      "Manitoba National",
      "Manitoba ",
      "Manitoba Media advisory",
      "Rosebank"
    ) ~ "Manitoba",
    region %in% c(
      "PEI",
      "Queens and Kings Districts",
      "Queens and Kings counties",
      "Queens and Kings County"
    ) ~ "Prince Edward Island",
    region %in% c(
      "N.B. ",
      "N.B.",
      "New Brunswick Statement",
      "Aroostook and Oxbow",
      "New Brunswick Media advisory",
      "NB"
    ) ~ "New Brunswick",
    region %in% c(
      "Yukon Media advisory",
      "Carcross",
      "Yukon ",
      "Whitehorse",
      "Yukon Statement",
      "Ross River",
      "Haines Junction",
      "Faro"
    ) ~ "Yukon",
    region %in% c(
      "Alberta Statement",
      "Alberta National",
      "Alberta Media advisory",
      "Alta.",
      "Alta",
      "Alberta Depot",
      "Alberta National Depot",
      "Alberta National Statement",
      "Three Hills and Stettler"
    ) ~ "Alberta",
    region %in% c(
      "Newfoundland and Labrador Media advisory",
      "Newfoundland and Labrador Statement",
      "Nain",
      "Stephenville",
      "Deer Lake",
      "Hopedale",
      "Ferryland and Stephenville",
      "Grand Falls-Windsor",
      "Holyrood and Stephenville"
    ) ~ "Newfoundland and Labrador",
    region %in% c(
      "British Columbia National",
      "Green Lake"
    ) ~ "British Columbia",
    region %in% c(
      "Nunavut Media advisory"
    ) ~ "Nunavut",
    region %in% c(
      "Northwest Territories Media advisory"
    ) ~ "Northwest Territories",
    TRUE ~ region
  )
}

Next we will apply our cleaning function to to our DataFrame, using the mutate() function from library(dplyr). Before this, to deal with the 24 NA values we identified in region2, we are going to apply another dplyr function called coalesce(), which will move replace the NA values in region2 with the values from region1. We also going to add one additional step before applying our custom cleaning function. This will be to trim any unnecessary white space from the beginning and end of each entry in region2. To do this, we use the str_trim() function from library(stringr). Next, we’ll apply our cleaning function, after which we’ll inspect the results to see if it worked.

#install.packages("stringr")
library(stringr)
rcmp_news_pp <- rcmp_news_pp %>%
  mutate(region2 = coalesce(region2, region1)) %>%
  mutate(region2 = str_trim(region2)) %>%
  mutate(region2 = recode_regions(region2))

rcmp_region_count <- rcmp_news_pp %>%
  count(region2)
  
paged_table(rcmp_region_count)

It worked! Finally, let’s rename our region1 and region2 variables, calling them town_city_county and prov_terr. To this we can use the rename() function from library(dplyr).

rcmp_news_pp <- rcmp_news_pp %>%
  rename(town_city_county = region1,
         prov_terr = region2)

Now that we’ve fixed our issues with the province/territory data in our DataFrame, we can begin to explore the corpus. A great technique for getting to know the data in your corpus is to use data visualization. Combining standard data manipulation and data visualization techniques like counting and bar graphs, we can easily visualize some of the more macro-level patterns in our corpus (publication of news releases over time, by province/territory, etc.). To do this, we’ll be using an R data visualization library called library(ggplot2). Since data often needs to be manipulated to some degree before visualization (e.g., counting the number of unique provinces/territories in the DataFrame, in order to create a bar chart visually representing the differences), we’ll also be using more of library(stringr) and library(dplyr). Let’s begin with a bar graph showing the total number of press releases in our corpus by province/territory. Note that to produce these graphs, we are going to use the theme aesthetics from library(hrbrthemes) and the default colour palette from library(fishualize), so you’ll need to install these two libraries first before you can run the code.

#install.packages("ggplot2")
#install.packages("hrbrthemes")
#install.packages("fishualize")
library(ggplot2)
rcmp_news_pp %>%
  group_by(prov_terr) %>%
  count(name = "count") %>%
  ggplot(aes(x = reorder(prov_terr, count), y = count)) +
  geom_col(fill = "#ca1928", show.legend = FALSE) +
  hrbrthemes::theme_ipsum(grid="") +
  theme(axis.text.x = element_blank()) +
  geom_text(aes(label=scales::comma(count)), hjust=0, nudge_y=50) +
  expand_limits(y = c(0, 3400)) +
  coord_flip() +
  labs(x = "", y = "")

Nova Scotia, New Brunswick, and Newfound and Labrador account for the majority of news releases in the database we scraped. The two largest provinces in Canada - Ontario and Quebec - account for the least. Why? Here we can use some domain knowledge to interpret this: There are very few RCMP detachments in Ontario and Quebec, as both have provincial police forces (Ontario Provincial Police and Sûreté du Québec) that absorb many of the police duties that would otherwise be contracted to the RCMP. This fits clearly with what we know about policing in Canada. What is less immediately clear is why British Columbia only has two entries. Based again on our domain knowledge, we know that British Columbia has a strong RCMP presence (and unlike Ontario and Québec, does not have a provincial police force to act in lieu of the RCMP). So why the low number of news releases? …

Next, let’s look at the total proportion of news releases produced over time by province/territory. To calculate this, we are going to manipulate the date variable, removing information about day and time, in order to count the total number of news releases produced by each province/territory for each year-month combination (2016-05, 2016-06, 2016-07, etc.).

rcmp_news_count <- rcmp_news_pp %>%
  filter(prov_terr != "British Columbia") %>%
  filter(!is.na(date_published)) %>%
  group_by(prov_terr, date_published) %>%
  mutate(date_published = str_remove_all(pattern= " (\\d{2})(\\:)(\\d{2})(\\:)(\\d{2})", date_published)) %>%
  mutate(date_published = as.Date(lubridate::ymd(date_published))) %>%
  count(name = "count", .drop = TRUE)

rcmp_news_count <- rcmp_news_count %>%
  mutate(date_published = format(date_published, "%Y-%m")) %>%
  mutate(date_published = paste(date_published, "-01", sep = "")) %>%
  mutate(date_published = as.Date(date_published, "%Y-%m-%d"))

rcmp_news_count <- rcmp_news_count %>%
  group_by(date_published, prov_terr) %>%
  summarize(count = sum(count)) %>%
  mutate(total = sum(count)) %>%
  mutate(prop = count/total) %>%
  filter(date_published != "2016-12-01")

rcmp_news_count %>%
  ggplot(aes(x = date_published, y = prop, fill = prov_terr)) +
  geom_col(position = "stack") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_date(breaks = as.Date(c("2017-01-01", "2018-01-01", "2019-01-01", "2020-01-01")), date_labels = "%Y") +
  fishualize::scale_fill_fish_d() +
  hrbrthemes::theme_ipsum(grid = "") +
  labs(y = "Proportion", x = "",
       fill = "")

Now let’s do the same thing with the raw count rather than the proportion.

rcmp_news_count %>%
  ggplot(aes(x = date_published, y = count, color = prov_terr)) +
  geom_smooth(se = FALSE) +
  scale_y_continuous() +
  scale_x_date(breaks = as.Date(c("2017-01-01", "2018-01-01", "2019-01-01", "2020-01-01")), date_labels = "%Y") +
  fishualize::scale_color_fish_d() +
  hrbrthemes::theme_ipsum(grid = "") +
  labs(y = "Number of news releases", x = "",
       color = "")

Finally, let’s look at the distribution of total word count across each of the news releases in our corpus, again by province/territory. To produce this particular graph, we are going to use the geom_density_ridges() function from library(ggridges), an add on to library(ggplot2).

#install.packages("ggridges")
library(ggridges)
rcmp_news_pp %>%
  filter(prov_terr != "British Columbia") %>%
  mutate(word_count = str_count(full_text, "\\w+")) %>%
  mutate(word_count_mean = mean(word_count, na.rm = TRUE)) %>%
  group_by(prov_terr) %>%
  ggplot(aes(x = word_count, y = prov_terr)) +
  geom_density_ridges(fill = "#ca1928", show.legend = FALSE) +
  geom_vline(aes(xintercept = word_count_mean), linetype="dotted", size = 1) +
  scale_x_log10(breaks = c(1, 10, 100, 1000, 5805)) +
  hrbrthemes::theme_ipsum(grid = "X") +
  labs(x = "word count (log scale)", y = "")

Sampling and Outputting

As noted at the outset, we are interested in analyzing not only the textual component of our RCMP news corpus, but the visual component as well. The first question we might ask then is how many of the news releases actually contain images? To calculate this, we can create a table with four columns: one that indicates the province/territory of the RCMP detachment, one of the indicates the total number of news releases in the corpus, one of that indicates the total number of news releases in the corpus containing images, and finally, one that contains the percentage score of press releases taht contain images (for each province/territory). We can create this table using a number of helpful functions from library(dplyr), including filter() (to take out British Columbia), add_count() (to count the total number of press releases and the number of press releases containing images), and mutate() (to calculate the percentage score of news releases containing images), among others.

table1 <- rcmp_news_pp %>%
  filter(prov_terr != "British Columbia") %>%
  group_by(prov_terr) %>%
  add_count(prov_terr, name = "total_press_releases") %>%
  add_count(image_url, name = "total_images") %>%
  summarize(total_press_releases = max(total_press_releases),
            total_images = max(total_images)) %>%
  mutate(proportion = round(total_images/total_press_releases*100, 2)) %>%
  select(prov_terr, total_press_releases, total_images, proportion) %>%
  arrange(desc(proportion)) %>%
  rename(`Province/Territory` = prov_terr,
         `Total NRs` = total_press_releases,
         `Total NRs w/ Images` = total_images,
         `Percent NRs w/ Images` = proportion) %>%
  ungroup()

paged_table(table1)

These are very interesting results. But this is obviously much too large a corpus to qualitatively analyze, even if we only analyzed press releases that contained images. So we want to do now is sample the larger corpus, creating a subset that we can reasonably analyze. Although there are many ways one could do this, we are going to use a crude and simple methodology. This involves creating a dictionary of keywords, searching for the presence of these key words in the text of each news release, and including any news release that contains one or more of our keywords, and excluding all of those that contain none of our keywords. Since we are interested in textual and visual representations of alleged drug crimes, we are going to create a dictionary of drug-related keywords. Each word in the string is separated by the “|” symbol, which stands for “or”. This is necessary as it will tell our string detection algorithm to search for the presence of any of the words in dictionary, rather than the presence of all of them. Note that we are writing the words in lower case and without any pluralization. This is because…

table2 <- rcmp_news_pp %>%
  filter(prov_terr != "British Columbia") %>%
  filter(!is.na(image_url),
         !is.na(full_text)) %>%
  group_by(prov_terr, image_url) %>%
  mutate(full_text = str_to_lower(full_text)) %>%
  count(tf = str_detect(full_text, my_pattern2)) %>%
  group_by(prov_terr, tf) %>%
  summarize(sum = sum(n)) %>%
  pivot_wider(names_from = tf, values_from = sum) %>%
  mutate(`FALSE` = `FALSE` + `TRUE`) %>%
  rename(`Total NRs w/ Images 2` = `TRUE`,
         `Total NRs 2` = `FALSE`,
         `Province/Territory` = prov_terr) %>%
  mutate(`Percent NRs w/ Images 2` = round(`Total NRs w/ Images 2`/`Total NRs 2`*100, 2)) %>%
  arrange(desc(`Percent NRs w/ Images 2`)) %>%
  ungroup()

paged_table(table2)

To inspect the results of these two tables more fully, we can put them into the same table. It can also be useful to represent the results of a table as a heatmap…

#install.packages("paletteer")
#install.packages("gt")
library(paletteer)
library(gt)
table3 <- table1 %>%
  left_join(table2, by = c("Province/Territory" = "Province/Territory")) %>%
  rename(` ` = "Province/Territory")

table1 %>%
  gt() %>%
  #tab_spanner(label = "",
  #            columns = c(`Total NRs`, `Total NRs w/ Images`, `Percent NRs w/ Images`)) %>%
  data_color(
    columns = c(`Total NRs`, `Total NRs w/ Images`, `Percent NRs w/ Images`),
    colors = scales::col_numeric(
      paletteer::paletteer_d(
        palette = "ggsci::red_material")
        %>% as.character(),
      domain = NULL
      )
  )

The subset corpus of 997 news releases (containing images) is a much more reasonably sized corpus to analyze. Although we may end up wanting to subset this corpus further, it is at very least a good place to start. What we want to do now is write some code that will: 1) create a DataFrame that contains only the values of our subset corpus; and 2) download each of the images and text data for the news releases. We can use metadata from the DataFrame to automatically name each of the images and corresponding text files, and put them into the same folder on our computer.

#install.packages("splitstackshape")
library(splitstackshape)
#image_meta_data <- rcmp_news_pp %>% 
=======
image_meta_data <- rcmp_news_pp %>% 
>>>>>>> dc8e44b659ebb0487aed562f7514dc12c22a250d
  ungroup() %>%
  filter(prov_terr != "British Columbia") %>%
  filter(!is.na(image_url)) %>%
  #mutate(full_text = str_to_lower(full_text)) %>%
  #filter(str_detect(full_text, my_pattern)) %>%
  mutate(date_published = str_remove_all(pattern= " (\\d{2})(\\:)(\\d{2})(\\:)(\\d{2})", date_published)) %>%
  mutate(date_published = as.Date(lubridate::ymd(date_published))) %>%
  mutate(year = as.numeric(str_extract(date_published, "\\d{4}")))

#image_meta_data <- stratified(image_meta_data, c("prov_terr", "year"), .15)



image_urls <- image_meta_data %>% select(image_url)

image_urls1 <- slice(image_urls, 1:250)
image_meta_data1 <- slice(image_meta_data, 1:250)

image_urls2 <- slice(image_urls, 251:497)
image_meta_data2 <- slice(image_meta_data, 251:497)

image_urls3 <- slice(image_urls, 501:750)
#image_meta_data3 <- slice(image_meta_data, 501:750)

image_urls4 <- slice(image_urls, 751:997)
#image_meta_data4 <- slice(image_meta_data, 751:997)

Now let’s download the images.

options(timeout=500)

for (i in image_urls1){
  
  numbers <- 1:250
  
  download.file(i, paste("/Volumes/ajl_external/rcmp-news-images/rcmp-drug-crimes-stratified-random/", image_meta_data1$prov_terr, "_RCMP_", image_meta_data1$date_published, "_", numbers, ".jpg", sep = ""))
  
}

for (i in image_urls2){
  
  numbers <- 251:497
  
  download.file(i, paste("/Volumes/ajl_external/rcmp-news-images/rcmp-drug-crimes-stratified-random/", image_meta_data2$prov_terr, "_RCMP_", image_meta_data2$date_published, "_", numbers, ".jpg", sep = ""))
  
}

for (i in image_urls3){
  
  numbers <- 501:750
  
  download.file(i, paste("/Volumes/ajl_external/rcmp-news-images/rcmp-drug-crimes-corpus/", image_meta_data3$prov_terr, "_RCMP_", image_meta_data3$date_published, "_", numbers, ".jpg", sep = ""))
  
}

for (i in image_urls4){
  
  numbers <- 751:997
  
  download.file(i, paste("/Volumes/ajl_external/rcmp-news-images/rcmp-drug-crimes-corpus/", image_meta_data4$prov_terr, "_RCMP_", image_meta_data4$date_published, "_", numbers, ".jpg", sep = ""))
}

Finally, let’s download the text files for each news release.

subset_corpus <- rcmp_news %>% 
  ungroup() %>%
  filter(prov_terr != "British Columbia") %>%
  filter(!is.na(image_url)) %>%
  mutate(full_text_lower = str_to_lower(full_text)) %>%
  filter(str_detect(full_text_lower, my_pattern)) %>%
  select(-full_text_lower, -region2) %>%
  mutate(date_published = str_remove_all(pattern= " (\\d{2})(\\:)(\\d{2})(\\:)(\\d{2})", date_published)) %>%
  mutate(date_published = as.Date(lubridate::ymd(date_published))) %>%
  mutate(doc_id = row_number())

subset_corpus <- image_meta_data %>%
  mutate(date_published = str_remove_all(pattern= " (\\d{2})(\\:)(\\d{2})(\\:)(\\d{2})", date_published)) %>%
  mutate(date_published = as.Date(lubridate::ymd(date_published))) %>%
  mutate(doc_id = row_number()) %>%
  mutate(date_published2 = paste("\n\nDATE: ", as.character(date_published), "\n\n", sep = "")) %>%
  mutate(headline_text = paste("HEADLINE: ", headline_text, "\n\n", sep = "")) %>%
  mutate(headline_url = paste("ARTICLE URL: ", headline_url, "\n\n", sep = "")) %>%
  mutate(image_url = paste("IMAGE URL: ", image_url, "\n\n", sep = "")) %>%
  mutate(prov_terr2 = paste("CITY/TOWN, PROVINCE/TERRITORY: ", town_city_county, prov_terr, "\n\n", sep = "")) %>%
  mutate(full_text = paste("FULL TEXT: ", full_text, "\n\n", sep = "")) %>%
  mutate(text = paste(date_published2, headline_text, headline_url, image_url, prov_terr2, full_text)) %>%
  mutate(doc_names = paste(prov_terr, "_RCMP_", date_published, "_", doc_id, sep = ""))

setwd("/Volumes/ajl_external/rcmp-news-images/rcmp-drug-crimes-stratified-random/")

subset_corpus %>%
  select(doc_names, text) %>%
  group_by(doc_names) %>%
  do(write_csv(., paste0(unique(.$doc_names), ".txt", sep = ""), col_names = FALSE))
Folder containing the 997 images and text files from our subset corpus
Example text file and image in the folder

Analyzing

Findings, Discussion, and Conclusion